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Regulatory elements in the 3' untranslated regions (UTRs) of eukaryotic mRNAs influence mRNA localization, translation, 
and stability. 3'-UTR length is determined by the location at which mRNAs are cleaved and polyadenylated. The use 
of alternative polyadenylation sites is common, and can be regulated in different situations. I present a new method 
to identify cleavage and polyadenylation sites (CSs) at the genome-wide level. The approach is strand-specific, avoids 
RNA enzymatic modification steps that can introduce sequence-specific biases, and uses unique molecular identifiers 
to ensure that all identified CS originates from individual RNA molecules. I applied this method to create the first 
comprehensive genome-wide map of polyadenylation sites of the fission yeast Schizosaccharomyces pombe, comprising 
the analysis of 2 021 000 individual mRNAs that defined 8883 CSs. CSs were identified for 90% of coding genes and 50% of 
ncRNAs. Alternative polyadenylation was prevalent in both groups, with 41% and 45% of all detected genes, respectively, 
displaying more than one CS. The specificity of the cleavage reaction was gene-specific, resulting in highly variable 
levels of heterogeneity in 3'-UTR lengths. Finally, I show that for both coding and non-coding genes, the most common 
regulatory motif associated with CSs in fission yeast is the canonical human AAUAAA sequence. 



Introduction 

Most eukaryotic mRNAs carry a non-encoded tail of adenosine 
residues at their 3' end [poly(A) tail]. Poly(A) tails are added in 
two separate steps: in the first one, nascent mRNAs are cleaved; 
subsequently, the polyA tail is added to the cleaved substrate. 
Both reactions are coupled and require the function of a com- 
plex multisubunit machinery, which is recruited cotranslation- 
ally to pre-mRNAs. 1 Cleavage and polyadenylation in mammals 
are regulated by three major cw-acting sequences: 2 the AAUAAA 
sequence or a closely related motif, located 15-30 nucleotides 
upstream of the CS, a uridine-rich upstream sequence element 
(USE), situated 0-20 nucleotides upstream of the AAUAAA 
motif, and a downstream sequence element (DSE), found 0-20 
nucleotides downstream of the cleavage site. In budding yeast, 
the AAUAAA motif is less conserved, the DSE is not present, 
and there is a poorly conserved efficiency element (EE) upstream 
of AAUAAA. 3 

Most mRNAs have untranslated regions after their coding 
sequences (3' UTRs). 2 3' UTRs contain regulatory elements that 
are recognized by trans factors such as RNA-binding proteins and 
micro-RNAs, which regulate mRNA localization, stability, and 
translation. The position of the cleavage and polyadenylation site 
(CS) determines the length of the 3' UTR, and, thus, the regula- 
tory elements that will be included in the mature mRNA. The 

Correspondence to: Juan Mata; Email: jm593@cam.ac.uk 
Submitted: 05/28/2013; Revised: 07/02/2013; Accepted: 07/15/2013 
http://dx.doi.org/! 0.41 61/rna.25758 



use of multiple polyadenylation sites for a specific mRNA (alter- 
native polyadenylation, or APA) is widespread, 4 and appears to 
be regulated in tissue- and developmental-specific manners. For 
example, there is a switch to the use of proximal CSs during cell 
proliferation, 5 early development, 6,7 and cancer. 8 This leads to the 
production of shorter mRNA isoforms, which lack key binding 
sites for miRNAs and are therefore differentially regulated. 

CSs can be identified from standard RNA-seq data by exam- 
ining reads that span the boundary between the CS and the 
poly(A) tail. However, this approach is very inefficient because 
only a very small fraction of reads the can be used. Recently, 
a number of studies have employed next-generation sequenc- 
ing-based approaches specifically targeted to the identification 
of CSs. These methods have been applied to budding yeast, 
worms, mammals, and plants (reviewed in ref. 9). I present here 
a novel technique for the systematic mapping and quantification 
of CSs. The approach is based on the isolation and sequencing 
of mRNA ends. It is easy to implement, strand-specific, avoids 
RNA enzymatic modifications that can cause sequence-specific 
biases, and uses unique molecular identifiers (UMIs) to elimi- 
nate PCR amplification artifacts. I used this approach to create 
the first genome-wide map of polyadenylation sites in the fission 
yeast Schizosaccharomyces pombe. The results revealed widespread 
use of APA in both coding genes and long non-coding RNAs, as 
well as highly variable levels of specificity in the choice of CSs. 
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Figure 1. Outline of 3PC protocol. Total RNA is chemically fragmented and purified using 
oligo(dT) magnetic beads. Poly(A)-containing fragments are used as substrates for reverse 
transcription, and the resulting cDNAs are circularized and used as templates for PCR ampli- 
fication. Poly(A) and poly(T) sequences are indicated. "R" represents the random sequence 
of the unique molecular identifier, and "B" a multiplexing barcode. Sequences from the RT 
primer that are used for PCR amplification are represented in cyan and green. See Materials 
and Methods for more details. 



In addition, they showed that the most common polyadenylation 
signal used in fission yeast is the canonical human AAUAAA. 

Results 

A method for high-throughput mapping of cleavage and poly- 
adenylation sites. Below I refer to the method as 3PC (for 3' 
Poly (A) site mapping using cDNA Circularization). The strat- 
egy is summarized in Figure 1. After chemical fragmentation 
of total RNA, poly (A) -containing RNAs are purified using 
oligo(dT) magnetic beads. Isolated RNA fragments are reverse- 
transcribed using a primer (RT primer, Table SI) that contains 
an anchored oligo(dT) sequence 10 and two primer binding sites 



for nested PCR amplification. The resulting 
cDNAs are circularized and used as templates 
for PCR amplification. A spacer that cannot 
be traversed by DNA polymerases is located 
between both primer binding sites and allows 
the use of the circularized cDNA as a template 
without the need for linearization." The nested 
PCR primers add Illumina-specific adaptors 
compatible with single or paired-end sequenc- 
ing. 3PC is strand-specific, an essential feature 
for the analysis of the highly compact S. pombe 
genome, where overlapping transcripts are very 
common. 12 The circularization strategy allows 
the bypass of RNA ligations, which are known 
to be sequence-dependent." The RT prim- 
ers also contain a random sequence that serves 
as a unique molecular identifier (UMI). 13,14 
Therefore, each original cDNA is individually 
tagged and can be distinguished after PCR 
amplification. Single-end sequencing is per- 
formed on an Illumina platform starting from 
the side containing the oligo(dT) sequence, 
thus ensuring that the junction between the 
mRNA and the poly(A) sequence is reached 
in every clone. The data are analyzed using a 
straightforward and stringent bioinformatic 
pipeline (Fig. SI). First, identical sequences 
containing the same UMI are discarded, as 
they are derived from a single RNA fragment. 
This step removes PCR amplification artifacts 
and allows better quantification of poly(A) site 
usage. Second, UMIs and oligo(dT) sequences 
are removed from the reads. Third, reads are 
mapped to the S. pombe genome using the 
Bowtie aligner. 15 Finally, to remove sequences 
that may originate from internal poly(A) sites, 
reads that map upstream of adenosyl stretches 
are discarded (see Materials and Methods for 
details). I applied this strategy to map cleav- 
age and polyadenylation sites in exponentially 
growing S. pombe cells. Three libraries were 
generated from independent biological sam- 
ples. The libraries were sequenced on either an 
Illumina Genome Analyzer II or a HiSeq 2000 platform using 
standard Illumina primers, producing a total of 15,626,208 reads 
(Table S2). 45.6% of the sequences were identified as redundant 
using UMIs, leaving 8,489,700 unique reads. After all process- 
ing steps (Fig. SI), 2,021,000 reads were mapped to unique loca- 
tions in the S. pombe genome (Table S2). As a result of the use of 
UMIs, each of these reads corresponds to the CS of an individual 
mRNA molecule. To monitor the reproducibility of the proto- 
col, the number of reads mapping to the annotated 3' UTR of 
every fission yeast gene was quantified for each of the three inde- 
pendent experiments. The data showed very high reproducibil- 
ity (Fig. S2), with all pair-wise Pearson correlation coefficients 
above 0.9. To validate the accuracy of the approach, I compared 
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these data with 30 published CSs from three different labora- 
tories mapped using low-throughput methods (Fig. S3). 16 " 18 In 
every case, CSs mapped using the high-throughput approach 
overlapped exactly or were within 2 nucleotides of the published 
ones, demonstrating the ability of this strategy to determine CSs 
at high resolution. 

Cleavage and polyadenylation site usage in S. pombe. 99.7% 
of the mapped reads corresponded to annotated genes or non- 
coding RNAs (Table 1). As expected, there was a very strong bias 
toward 3' UTRs (95.8% of all reads). The data provide a compre- 
hensive survey of the CS landscape of S. pombe with an average 
of 360 reads per 3' UTR, and 90% of 3' UTRs with 10 reads 
or more. Genes for which no CS was detected were enriched in 
meiotically induced genes, 19 in agreement with the low expression 
levels of these genes in vegetative cells. 20 Coverage of ncRNAs 
was lower, with a mean number of reads of 34 and only 15% of 
genes with 10 reads or more. This is consistent with previous 
data showing that ncRNAs tend to be expressed at low levels in 
S. pombe. 2 ' 

Visual inspection of the data revealed variable numbers of 
CSs. For example, the actl gene displays three major CSs, while 
the adhl gene contains a single one (Fig. 2A-C). In addition, 
most CSs displayed some degree of microheterogeneity (Fig. 2B), 
in which CSs did not map to a single nucleotide (although they 
tended to cluster in defined regions of the 3' UTR). A simi- 
lar phenomenon has been reported in other eukaryotes, and is 
thought to be caused by the imprecise nature of the cleavage reac- 
tion (ref. 4 and references therein). 

The amount of CS heterogeneity varied widely among differ- 
ent genes (Fig. 3). For instance, the two genes in Figure 3A and B 
are expressed at similar levels and have UTRs of almost identical 
lengths, but display very different CS profiles. To quantify this 
effect, a heterogeneity score was defined as the minimal number 
of CS positions in a given region required to account for at least 
90% of all observed CSs (Fig. 3C). The average number for all 
3' UTRs was 13.7, with a standard deviation of 6.2. The genes 
displayed in Figure 3A and B had scores of 3 and 31, respectively. 
The heterogeneity score did not correlate with expression levels 
(Pearson correlation -0.04), although it showed a weak correla- 
tion with 3' UTR length (Pearson correlation 0.43). ncRNAs 
had a slightly higher score (mean 16.6, standard deviation 8.0). 
Although microheterogeneity is a well-known phenomenon, the 
reasons for the specific behavior of different CSs are not clear. 

To quantify APA, neighboring CSs were grouped into 
clusters (see Materials and Methods). The number of identi- 
fied clusters were 7,253 for 3' UTRs (Data Sets SI and S2), 
1,277 for ncRNAs (Data Sets S3 and S4), and 353 for CDSs 
(Data Sets S5 and S6). Most reads could be assigned to clusters 
(Fig. 4A), which had an average length of 55.4 nucleotides. The 
median distance between the edges of adjacent clusters was 73 
nucleotides, and the mean separation was 160, strongly suggest- 
ing that the clusters represent independent CSs. The position of 
the 3' UTR CS with most reads within a cluster (peak) was used 
to estimate the length of the 3' UTR (Fig. 4B). Based on these 
data, the median length of a 3' UTR in S. pombe is 203 nucleo- 
tides, and the mean is 284. This is close to the medians of 166 



Table 1. Distribution of CSs among different genomic features 



Poly(A) site mapping 


Reads 


Fraction 


Non-coding RNAs 


47594 


2.4 


CDS 


18725 


0.9 


3'-UTR 


1 935 301 


95.8 


5-UTR 


13131 


0.6 


Others 


6249 


0.3 


Total 


2021 000 


100 



The number of reads mapped to non-coding RNAs, coding sequences 
(CDS), 5' UTRs and 3' UTRs are presented. "Others" represent reads 
mapping to intergenic sequences and introns, as well as reads mapping 
to overlapping features (for example, the 3'-UTR and the 5'-UTR of two 
tandem genes). 



nucleotides of the budding yeast Saccharomyces cerevisiae 12 and 
the 130 of Caenorhabditis elegansJ 

The positions of 3' UTR CSs, which define UTR lengths, 
were compared with the lengths of annotated UTRs (Fig. S4 
and Data Set S7). The comparison was performed in two ways: 
first, by using the most distal single CS identified by 3PC; sec- 
ond, by considering the peak of the most distal cluster. In the 
first case, the correlation was very high (Pearson correlation = 
0.94), although the UTRs defined by 3PC tended to be lon- 
ger (Fig. S4A). In the second case, the correlation was poorer 
(Pearson = 0.64, Fig. S4B). This is consistent with the fact that 
annotated UTRs are usually defined by the longest observed 
transcript, while the position of the peak of the cluster represents 
the length of a "typical" UTR. 

Analysis of the data for 3' UTRs also revealed a high preva- 
lence of APA, with the number of clusters ranging between one 
and five, and a mean number of 1.54 clusters per gene (Fig. 4C; 
Data Set S2). 41% of all detected S. pombe 3' UTRs contained 
more than one cluster. In ncRNAs, there was a similar trend to 
the presence of multiple CSs, with 45% of all detected ncRNAs 
having more than one cluster, and a mean of 1.63 clusters per 
gene (Fig. S5 and Data Set S4). These results demonstrate that 
APA is widespread in S. pombe. The extent of APA is comparable 
to that of multicellular eukaryotes. For example, 47% of genes 
expressed in HeLa cells display APA, with an average of 1.9 iso- 
forms per gene. 23 

Analysis of APA in several organisms has revealed preferences 
for the use of proximal or distal sites, which are often dynami- 
cally regulated. 5 " 8 To investigate if S. pombe displays a positional 
bias in the use of CS, I concentrated on those genes with exactly 
two CSs. The distributions of the number of reads that map to 
the proximal and the distal sites for each 3' UTR were compared 
with each other (Fig. S6). The distributions for both sites were 
very similar, demonstrating that S. pombe vegetative cells do not 
show a global positional preference in the use of CSs. 

A small number of CSs mapped within coding sequences (see 
Fig. 2D for an example). In contrast to 3' UTRs and ncRNAs, 
the number of CSs per gene within a coding sequence was almost 
always one (Fig. S5 and Data Sets S5 and S6), with a mean of 
1.14. These mRNAs tended to have additional CSs in the 3' UTR 
(93.0%, with an average of 1.42). CSs within coding sequences 
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Figure 2. Examples of polyadenylation sites. The y-axes show the number of reads and x-axes the position of the read along the gene. Only the 
nucleotide immediately before the poly(A) tail (cleavage site) is plotted. Grey boxes represent coding sequences and the arrows show the direction 
of transcription. Introns are represented by broken lines. (A) actl, which displays three major CSs in its 3' UTR. (B) High-resolution view of the first 
cluster of CSs in the actl 3' UTR. (C) adhl, which contains a single CS in its 3' UTR. (D) SPBP4H10.15, which has major CSs at both the coding sequence 
and the 3' UTR. 



would generate mRNAs without stop codons, which are likely to 
be degraded by the non-stop decay pathway. 24 

Cis-sequences regulating cleavage and polyadenylation. To 
identify rw-regulatory regions, the sequence distribution in the 
regions around CSs was examined (Fig. 5A). For this purpose, 
the peak of each CS cluster was used. A very strong enrichment in 
adenosines was present in nucleotides 1 and 2 after the CS. Note 
that the drop in adenosines immediately before the CS is due 
to the way in which the sequences are processed (see Materials 
and Methods for details), as it is not possible to unambiguously 
map cases in which cleavage occurs after an adenosine. In addi- 
tion, there was an enrichment in adenosines upstream of the 
CS, extending approximately between -29 and -13, and with 
a peak at position -20. Finally, uridines were enriched before 
and around the CS, showing peaks at positions -8 and 1. A very 
similar pattern was found for ncRNAs (Fig. S7A). Comparable 
patterns have been observed in budding yeast and multicellular 

7 10 

organisms. ' • 

I then searched for over-represented sequence motifs in the 
regions surrounding the CSs. As above, the peaks of each cluster 
were chosen for this analysis. For upstream motifs, the sequences 
between -5 and -50 were scanned. The search revealed 40 
enriched hexamers with compositions high in adenosines and/or 
uridines (Table S3). The most abundant hexamer was the canoni- 
cal human AAUAAA (present in 20.1% of all CSs), while several 
other motifs had sequences that partly overlapped with it. The 
majority of A-rich motifs were located around positions -23 to -26 
(Fig. 5B; Table S3). For example, AAUAAA peaked at position 
-24. By contrast, some U-rich motifs, such as UUUUUU, were 
located closer to the CS (position -14). This is consistent with the 



overall enrichment in U-rich sequences observed at this distance 
from the CS (Fig. 5A). The specific location of these motifs sug- 
gests that they may be of functional importance. Examination of 
ncRNAs revealed related motifs (Table S4), with AAUAAA hav- 
ing the smallest P value and being the most abundant (25.1%). 
AAUAAA was also located at a similar position to that of 3' UTRs, 
peaking at -22 from the CS (Fig. S7B). These results suggest that 
the signals controlling the polyadenylation of coding and non-cod- 
ing genes are similar. By contrast, no over-represented sequences 
were detected downstream of the CS. This was unexpected, given 
that elements located after CSs have been shown to be important 
for efficient cleavage and polyadenylation in S. pombe. 26 

Discussion 

I present here a new method to map cleavage and polyadenylation 
sites and its application to the fission yeast S. pombe. 3PC is based 
on the use of an oligodT primer to enrich mRNA ends, followed 
by sequencing of the library across the oligo(dT) sequence. The 
protocol is straightforward, highly reproducible, strand-specific, 
compatible with multiplexing, and avoids the use of enzymatic 
modifications of RNA, especially RNA ligations, which suffer 
from sequence biases. 11 Moreover, 3PC uses UMIs to avoid PCR 
amplification artifacts. A number of similar protocols have been 
published recently (see ref. 27 for a comparison), but none of 
them involves the use of UMIs. 

Current protocols for the preparation of libraries for Illumina 
sequencing involve PCR amplification. This can lead to biases if 
the efficiency of amplification of individual clones is different. 
Moreover, random sampling of clones within the library means 



1410 



RNA Biology 



Volume 10 Issue 8 



B 



CD 

E 



that the number of times a clone is sequenced may not 
reflect its original abundance in the library. A key advan- 
tage of 3PC is the use of UMIs, which ensures that every 
sequenced poly(A) site corresponds to that of an individual 
RNA molecule, thus providing more accurate quantifica- 
tion of relative CS usage. In addition, given the number 
of sequence reads and the fraction of unique sequenced 
clones (obtained using UMIs), it is possible to estimate 
the complexity of the original library (Fig. S8). This 
information can be used to guide future experiments: for 
example, if the complexity of a library is low, additional 
sequencing will provide little new information, and efforts 
should concentrate on generating a new library. 

3PC requires sequencing through the oligo(dT) 
sequence. This ensures that every read reaches a CS site, 
but can lead to loss of accuracy and efficiency during 
sequencing. 27 A number of solutions have been pro- 
posed to this problem, including filling in the oligo(dA) 
sequencing with thymidines immediately before 
sequencing, 27 and the use of custom sequencing prim- 
ers. 28 These procedures are incompatible with the use of 
UMIs as implemented here, which requires that they be 
located 5' relative to the oligo(dT) in the RT oligo [and 
thus that they be sequenced before the poly(A) tail] . To 
avoid problems caused by accuracy loss, a very strin- 
gent analysis protocol was implemented, in which only 
reads that contain exactly the expected number of Ts 
in their sequence followed by the expected anchor were 
retained (see Materials and Methods). This resulted in 
the removal of -38% of sequences. As using the meth- 
ods discussed above would involve losing the ability to 
employ UMIs, and given that with current technologies 
sequencing depth is rarely limiting, I consider this lower 
efficiency is acceptable. 

This is the first genome-wide analysis of CS usage 
in the fission yeast S. pombe. Over 2 million individual 
non-redundant CSs were mapped, identifying 8883 
major CSs. This resulted in good coverage for more 
than 90% of coding sequences and 15% of annotated 
ncRNAs. Although 3' UTRs have been annotated in 
S. pombe, 29 ' 32 the data were based on standard RNA-seq 
experiments and could not distinguish between differ- 
ent isoforms generated by APA. 

A very interesting finding is the widespread use of 
APA in fission yeast. Even in vegetative conditions, it appears 
that many mRNAs exist as different isoforms, with the length 
of their UTRs differing in some cases by hundreds of nucleo- 
tides (as an example, see actl, a highly abundant mRNA, Fig. 2). 
APA can affect mRNA fate; for example, there are two forms of 
the brain-derived neurotrophic factor BDNF that differ in the 
lengths of their 3' UTRs that differentially localized and trans- 
lated (reviewed in ref. 33). It has also been shown that longer 
UTRs can have a negative effect on gene expression through the 
increased number of miRNA binding sites. 8 However, in most 
cases, the physiological importance of the existence of multiple 
forms, especially in unicellular eukaryotes, is unknown. 
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Figure 3. Analysis of cleavage site heterogeneity. In (A and B) the y-axes show 
the number of reads and the x-axes the position of the read along the gene. Only 
the nucleotide immediately before the poly(A) tail (cleavage site) is plotted. Only 
the 3' UTRs are depicted, with the arrows showing the direction of transcription. 
(A) SPBC365.16, which has a heterogeneity score of 3. (B) SPA1071.01c, which has 
a heterogeneity score of 31. (C) Histogram displaying the heterogeneity score for 
all 3' UTRs. 



The comprehensive data set presented in this manuscript will 
be of great value for S. pombe researchers, by assisting in the anno- 
tation of the fission yeast genome, and as the basis for hypothesis- 
driven experiments into gene expression control in this model 
organism. In addition, I expect that these data will provide the 
foundation for comparative studies of among various organisms. 

Materials and Methods 

Fission yeast methods. Standard fission yeast methods were 
used for all experiments. 972 h-cells were grown in Edinburgh 
Minimal Medium (EMM) at 32 °C. 
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Figure 4. Analysis of poly(A) site usage in 3' UTRs. The y-axes show the 
number of genes in each category. (A) Histogram showing the fraction 
of reads that could be assigned to a cluster within a 3' UTR. (B) Histo- 
gram of 3' UTR lengths, calculated using the peak of each cluster. (C) 
Histogram displaying the number of identified CSs per 3' UTR. 



Library preparation and sequencing. All primers used are 
described in Table SI and the structure of the RT primer is pre- 
sented in Figure S9. Total RNA was prepared using a hot-phe- 
nol extraction protocol. 100 |xg of total mRNA were incubated 
with 1/9 volumes of RNA fragmentation reagents (Ambion, Life 
Technologies) at 70 °C for 4.5 min. The reaction was stopped by 
adding 1/9 volumes of stop solution (Ambion, Life Technologies) 
and transferring the samples to 4 °C. Polyadenylated frag- 
ments were purified using oligo(dT) magnetic beads (Life 
Technologies). 200 |xl of beads were washed in binding buffer 
(20 mM TRIS-HC1 pH 7.0, 2 mM EDTA, 1 M LiCl). 50 |xl of 
binding buffer was added to 50 |xl of fragmented RNA, and the 
mix was incubated at 65 °C for 2 min and then transferred to ice. 
The RNA was then added to the beads, and incubated at room 
temperature for 5 min. The beads were washed twice in wash 
buffer (10 mM TRIS-HC1 pH 7.0, 1 mM EDTA, 0.15 M LiCl). 
RNA was eluted by resuspending the beads in 22 u,l of 10 mM 
Tris pH 8.0 and incubating at 80 °C for 2 min. The remaining 
RNA in lysis buffer was incubated with the beads once more 
and washed as described above, and the two eluates combined. 
For reverse transcription, 11 (xl of purified RNA fragments were 
mixed with 1 |xl of dNTPs (10 mM each) and 1 ixl of 50 U.M 
RT primer, incubated at 65 °C for 5 min and transferred to ice. 
Four u.1 of 5X FFS buffer (Invitrogen, Life Technologies), 1 ul 
of SuperaseIN (Ambion, Life Technologies), 1 (Jtl of 0.1 M DTT, 
and 1 |xl of Superscript III (Invitrogen, Life Technologies) were 
added, and the reaction incubated at 48 °C for 40 min. RNA 
was hydrolyzed by adding 2.3 |xl of 1 M NaOH and incubating 
at 95 °C for 15 min, and the solution neutralized with 2.3 |xl of 
1 M HC1. The cDNA was purified using Agencourt AMPure 
magnetic beads (Beckman Coulter). 46 |xl of beads were added 
to the cDNA solution (-1.85 volumes), and the beads washed 
twice with 180 |xl of freshly prepared 70% ethanol. The cDNA 
was eluted by resuspending the beads in 40 ul of water. A second 
round of purification was performed as above, except that a lower 
ratio of beads to sample (1.6 volumes) was used. This second 
round is essential for the complete removal of RT primer, which 
can otherwise circularize and be amplified by PCR. Eight ixl of 
cDNA were circularized using Circligase II (Epicenter) as fol- 
lows: 0.5 (xl of 50 mM MnCl,, 1 u.1 of 10X Reaction buffer 
and 0.5 |xl of enzyme were added, and the reaction incubated at 
60 °C for 60 min and at 80 °C for 10 min. PCR amplification 
was performed using Phusion High Fidelity Polymerase (Thermo 
Scientific). Two |xl of circularised cDNAwere used as a template, 
and amplified in a 50 (xl volume as described by the manufac- 
turer. Primers P3 and P5 were used at a final concentration of 
0.2 (xM. PCR amplification was performed in two stages: four 
cycles using an annealing temperature of 66 °C, followed by a 
variable number of cycles at 72 °C. Small scale reactions were 
performed to determine the optimal number of cycles, which 
ranged between 12 and 18 (including both stages). Samples were 
sequenced in either an Illumina Genome Analyzer II or a HiSeq 
2000 platform. As low diversity sequences [such as oligo(dT) 
stretches] can affect sequencing performance on Illumina plat- 
forms, samples were mixed with unrelated high-diversity librar- 
ies, and/or mixed with 20% of a PhiX control. 
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Bioinformatics analyses. All data were processed using 
custom-made scripts written in Perl and visualized using the 
Integrated Genome Viewer. 34 Downstream statistical analyses 
were performed with R. Reads were first demultiplexed to sepa- 
rate different libraries. To remove redundant sequences, a single 
copy was retained of identical reads containing the same random 
UMI. Reads that did not contain only Ts between positions 7-21, 
followed by A/C/G at position 22, were discarded (see Fig. S9). 
After that, sequences between bases 1-21 were removed (includ- 
ing random UMIs, multiplexing barcodes, and dT sequences 
from the RT primer, Fig. S9), and all sequences were reverse com- 
plemented. For all analyses, S. pombe annotations and sequences 
available from GeneDB (http://old.genedb.org/), now PomBase 
(http://www.pombase.org/), on May 9, 2011 were used. Processed 
sequences were aligned to the S. pombe genome using Bowtie 15 
with the following parameters: -S -v 2 -m 1 (two mismatches 
allowed, and reporting only reads that map to a single location in 
the genome). Aligned reads were then filtered to remove sequences 
potentially generated from internal poly(A) sequences, by discard- 
ing reads followed by either four or more consecutive As, or by 
seven or more As in the 10 downstream bases. Finally, reads that 
contained a mismatch in the last base were also discarded. The 
most downstream base within the mapped sequence was defined 
as the CS. 3' UTRs were extended by 200 nucleotides from the 
annotated transcription termination site. Clusters were defined 
iteratively, by merging reads separated by less than a fixed num- 
ber of nucleotides from each other. As the distance used to define 
clusters will influence their length and number, we performed the 
analysis using distances from 1-30 (Fig. S10). As expected, the 
fraction of genes containing more than one cluster decreased with 
length. However, the conclusion that APA is widespread remained 
robust for all numbers tested. A conservative distance of 25 nucle- 
otides was used for all the analyses, which resulted in a median 
separation between cluster edges of 73 nucleotides. Only clusters 
containing at least six reads and at least 10% of the reads assigned 
to the corresponding gene were considered. To quantify hetero- 
geneity, the minimal number of positions required to account 
for 90% of all observed CSs within a 3' UTR was calculated. To 
avoid biases created by differential read coverage, 4 a fixed number 
of reads was set at 200. For genes with less reads, the data were 
expanded by sampling with replacement from the original data set 
until 200 was reached. For genes with more, 200 reads were ran- 
domly selected. Only genes with more than 30 reads were used to 
calculate the heterogeneity score. To identify regulatory elements, 
sequences between -50 and -5 from the CS, or between +2 and 
+50, were scanned for the presence of over-represented hexamers. 
The background model was generated from the frequency of all 
hexamers in the regions -200 to -100 from the CS. Statistical 
significance was calculated using a normal approximation to the 
cumulative binomial distribution. For this analysis, a single event 
from each cluster of poly(A) sites was used, corresponding to the 
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nucleotide with the highest number of reads. All raw data files 
have been deposited in the ArrayExpress database with accession 
number E-MTAB-1642. 
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Figure 5. Cleavage and polyadenylation regulatory sequences in 3' 
UTRs. The y-axes show the number of CSs in 3' UTRs. Only the peak of 
each cluster was used for the analyses. (A) Nucleotide composition of 
sequences around the CS. (B) Location of the most enriched motifs rela 
tive to the position of the CS. The numbers refer to the location of the 
first nucleotide of the hexamer. 
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